Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

نویسندگان

Milos Radovanovic

Alexandros Nanopoulos

Mirjana Ivanovic

چکیده

Different aspects of the curse of dimensionality are known to present serious challenges to various machine-learning methods and tasks. This paper explores a new aspect of the dimensionality curse, referred to as hubness, that affects the distribution of k-occurrences: the number of times a point appears among the k nearest neighbors of other points in a data set. Through theoretical and empirical analysis involving synthetic and real data sets we show that under commonly used assumptions this distribution becomes considerably skewed as dimensionality increases, causing the emergence of hubs, that is, points with very high k-occurrences which effectively represent “popular” nearest neighbors. We examine the origins of this phenomenon, showing that it is an inherent property of data distributions in high-dimensional vector space, discuss its interaction with dimensionality reduction, and explore its influence on a wide range of machine-learning tasks directly or indirectly based on measuring distances, belonging to supervised, semi-supervised, and unsupervised learning families.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Local and global scaling reduce hubs in space

Hubness’ has recently been identified as a general problem of high dimensional data spaces, manifesting itself in the emergence of objects, so-called hubs, which tend to be among the k nearest neighbors of a large number of data items. As a consequence many nearest neighbor relations in the distance space are asymmetric, that is, object y is amongst the nearest neighbors of x but not vice versa...

متن کامل

RNN (Reverse Nearest Neighbour) in Unproven Reserve Based Outlier Discovery

Outlier detection refers to task of identifying patterns. They don’t conform establish regular behavior. Outlier detection in highdimensional data presents various challenges resulting from the “curse of dimensionality”. The current view is that distance concentration that is tendency of distances in high-dimensional data to become in discernible making distance-based methods label all points a...

متن کامل

Class imbalance and the curse of minority hubs

Most machine learning tasks involve learning from high-dimensional data, which is often quite difficult to handle. Hubness is an aspect of the curse of dimensionality that was shown to be highly detrimental to k-nearest neighbor methods in high-dimensional feature spaces. Hubs, very frequent nearest neighbors, emerge as centers of influence within the data and often act as semantic singularitie...

متن کامل

Using Mutual Proximity to Improve Content-Based Audio Similarity

This work introduces Mutual Proximity, an unsupervised method which transforms arbitrary distances to similarities computed from the shared neighborhood of two data points. This reinterpretation aims to correct inconsistencies in the original distance space, like the hub phenomenon. Hubs are objects which appear unwontedly often as nearest neighbors in predominantly high-dimensional spaces. We ...

متن کامل

Classification of Chronic Kidney Disease Patients via k-important Neighbors in High Dimensional Metabolomics Dataset

Background: Chronic kidney disease (CKD), characterized by progressive loss of renal function, is becoming a growing problem in the general population. New analytical technologies such as “omics”-based approaches, including metabolomics, provide a useful platform for biomarker discovery and improvement of CKD management. In metabolomics studies, not only prediction accuracy is ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Journal of Machine Learning Research

دوره 11 شماره

صفحات -

تاریخ انتشار 2010

Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

نویسندگان

چکیده

منابع مشابه

Local and global scaling reduce hubs in space

RNN (Reverse Nearest Neighbour) in Unproven Reserve Based Outlier Discovery

Class imbalance and the curse of minority hubs

Using Mutual Proximity to Improve Content-Based Audio Similarity

Classification of Chronic Kidney Disease Patients via k-important Neighbors in High Dimensional Metabolomics Dataset

عنوان ژورنال:

اشتراک گذاری